46 research outputs found

    Feature Relevance in Ward’s Hierarchical Clustering Using the Lp Norm

    Get PDF
    In this paper we introduce a new hierarchical clustering algorithm called Ward p . Unlike the original Ward, Ward p generates feature weights, which can be seen as feature rescaling factors thanks to the use of the L p norm. The feature weights are cluster dependent, allowing a feature to have different degrees of relevance at different clusters. We validate our method by performing experiments on a total of 75 real-world and synthetic datasets, with and without added features made of uniformly random noise. Our experiments show that: (i) the use of our feature weighting method produces results that are superior to those produced by the original Ward method on datasets containing noise features; (ii) it is indeed possible to estimate a good exponent p under a totally unsupervised framework. The clusterings produced by Ward p are dependent on p. This makes the estimation of a good value for this exponent a requirement for this algorithm, and indeed for any other also based on the Lp norm.Peer reviewedFinal Accepted Versio

    On sum-free subsets of abelian groups

    Full text link
    In this paper we discuss some of the key properties of sum-free subsets of abelian groups. Our discussion has been designed with a broader readership in mind, and is hence not overly technical. We consider answers to questions like: how many sum-free subsets are there in a given abelian group GG? what are its sum-free subsets of maximum cardinality? what is the maximum cardinality of these sum-free subsets? what does a typical sum-free subset of GG looks like? among others

    Removing redundant features via clustering : preliminary results in mental task separation

    Get PDF
    Recent clustering algorithms have been designed to take into account the degree of relevance of each feature, by automatically calculating their weights. However, as the tendency is to evaluate each feature at a time, these algorithms may have difficulties dealing with features containing similar information. Should this information be relevant, these algorithms would set high weights to all such features instead of removing some due to their redundant nature. In this paper we introduce an unsupervised feature selection method that targets redundant features. Our method clusters similar features together and selects a subset of representative features for each cluster. This selection is based on the maximum information compression index between each feature and its respective cluster centroid. We empirically validate out method by comparing with it with a popular unsupervised feature selection on three EEG data sets. We find that ours selects features that produce better cluster recovery, without the need for an extra user-defined parameterFinal Accepted Versio

    Effective Spell Checking Methods Using Clustering Algorithms

    Get PDF
    This paper presents a novel approach to spell checking using dictionary clustering. The main goal is to reduce the number of times distances have to be calculated when finding target words for misspellings. The method is unsupervised and combines the application of anomalous pattern initialization and partition around medoids (PAM). To evaluate the method, we used an English misspelling list compiled using real examples extracted from the Birkbeck spelling error corpus.Final Published versio

    On Partitional Clustering of Malware

    Get PDF
    In this paper we fully describe a novel clustering method for malware, from the transformation of data into a manipulable standardised data matrix, finding the number of clusters until the clustering itself including visualisation of the high-dimensional data. Our clustering method deals well with categorical data and clusters the behavioural data of 17,000 websites, acquired with Capture-HPC, in less than 2 minutesPeer reviewedFinal Accepted Versio

    A clustering based approach to reduce feature redundancy

    Get PDF
    This document is the Accepted Manuscript version of the following paper: Cordeiro de Amorim, R.,and Mirkin, B., ‘A clustering based approach to reduce feature redundancy’, in Proceedings, Andrzej M. J. Skulimowski and Janusz Kacprzyk, eds., Knowledge, Information and Creativity Support Systems: Recent Trends, Advances and Solutions, Selected papers from KICSS’2013 - 8th International Conference on Knowledge, Information, and Creativity Support Systems, Kraków, Poland, 7-9 November 2013. ISBN 978-3-319-19089-1, e-ISBN 978-3-319-19090-7. Available online at doi: 10.1007/978-3-319-19090-7. © Springer International Publishing Switzerland 2016.Research effort has recently focused on designing feature weighting clustering algorithms. These algorithms automatically calculate the weight of each feature, representing their degree of relevance, in a data set. However, since most of these evaluate one feature at a time they may have difficulties to cluster data sets containing features with similar information. If a group of features contain the same relevant information, these clustering algorithms set high weights to each feature in this group, instead of removing some because of their redundant nature. This paper introduces an unsupervised feature selection method that can be used in the data pre-processing step to reduce the number of redundant features in a data set. This method clusters similar features together and then selects a subset of representative features for each cluster. This selection is based on the maximum information compression index between each feature and its respective cluster centroid. We present an empirical validation for our method by comparing it with a popular unsupervised feature selection on three EEG data sets. We find that our method selects features that produce better cluster recovery, without the need for an extra user-defined parameter.Final Accepted Versio

    Core clustering as a tool for tackling noise in cluster labels

    Get PDF
    Real-world data sets often contain mislabelled entities. This can be particularly problematic if the data set is being used by a supervised classification algorithm at its learning phase. In this case the accuracy of this classification algorithm, when applied to unlabelled data, is likely to suffer considerably. In this paper we introduce a clustering-based method capable of reducing the number of mislabelled entities in data sets. Our method can be summarised as follows: (i) cluster the data set; (ii) select the entities that have the most potential to be assigned to correct clusters; (iii) use the entities of the previous step to define the core clusters and map them to the labels using a confusion matrix; (iv) use the core clusters and our cluster membership criterion to correct the labels of the remaining entities. We perform numerous experiments to validate our method empirically using k-nearest neighbour classifiers as a benchmark. We experiment with both synthetic and real-world data sets with different proportions of mislabelled entities. Our experiments demonstrate that the proposed method produces promising results. Thus, it could be used as a pre-processing data correction step of a supervised machine learning algorithm
    corecore